Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-balance roles deterministically instead of randomly #209

Merged
merged 6 commits into from
Apr 5, 2024

Conversation

carlcsaposs-canonical
Copy link
Contributor

@carlcsaposs-canonical carlcsaposs-canonical commented Mar 25, 2024

(for auto-generated roles)

Algorithm:

  • if odd # of nodes: all nodes are cluster manager eligible
  • if even # of nodes: node with highest unit number is data node, all other nodes are cluster manager eligible

Replaces old algorithm of random.choice one of the nodes that should change roles

This will enable in-place upgrades to be ordered from highest to lowest unit number even if roles are re-balanced (for HA [e.g. because of network cut]) while upgrade is in-progress

That allows upgrade/rollback to be coordinated without leader unit setting information in peer databag, which allows partial rollback even if leader unit is in error state

Context:
https://chat.canonical.com/canonical/pl/zpqoyei54ffr5kwgwkbimizhwo
https://warthogs.atlassian.net/browse/DPE-3878


Also, this algorithm ensures that if a unit needs to be restarted for HA while upgrade in-progress, a unit (the highest number unit) on the newer version of the charm/workload will be upgraded. This reduces the likelihood that:

which would mean that units on the old version could not re-connect to the cluster

@carlcsaposs-canonical carlcsaposs-canonical marked this pull request as ready for review March 25, 2024 12:56
Copy link
Contributor

@phvalguima phvalguima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring proposal!

lib/charms/opensearch/v0/helper_cluster.py Show resolved Hide resolved
phvalguima
phvalguima previously approved these changes Mar 26, 2024
Copy link
Contributor

@phvalguima phvalguima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Carl! Great work! I have some comments regarding the logic we use to recompute the roles based on the count of only the nodes of the current cluster, whereas we should be based on the entirety of the fleet

lib/charms/opensearch/v0/helper_cluster.py Show resolved Hide resolved
lib/charms/opensearch/v0/helper_cluster.py Show resolved Hide resolved
lib/charms/opensearch/v0/helper_cluster.py Show resolved Hide resolved
(for auto-generated roles)

Algorithm:
- if odd # of nodes: all nodes are cluster manager eligible
- if even # of nodes: node with highest unit number is data node, all other nodes are cluster manager eligible

Replaces old algorithm of `random.choice` one of the nodes that should change roles

This will enable in-place upgrades to be ordered from highest to lowest unit number even if roles are re-balanced (for HA [e.g. because of network cut]) while upgrade is in-progress

That allows upgrade/rollback to be coordinated without leader unit setting information in peer databag, which allows partial rollback even if leader unit is in error state

---

Also, this algorithm ensures that if a unit needs to be restarted for HA while upgrade in-progress, a unit (the highest number unit) on the newer version of the charm/workload will be upgraded. This reduces the likelihood that:
- one of the last units on the older version restarts and
- the cluster manager switches to operating with the new workload version (i.e. it stops operating in compatability mode with the old version—see https://discuss.elastic.co/t/rolling-upgrades-master-nodes-voting-config-exclusions/320463/2)
which means that units on the old version cannot re-connect to the cluster
Highest unit number cannot always be determined (e.g. if unit hasn't joined peer relation). Unable to use `planned_units` since unit numbers are not necessarily sequential
max([]) raises ValueError
Copy link
Contributor

@Mehdi-Bendriss Mehdi-Bendriss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Carl!

@carlcsaposs-canonical carlcsaposs-canonical merged commit 4d27728 into main Apr 5, 2024
21 of 23 checks passed
@carlcsaposs-canonical carlcsaposs-canonical deleted the deterministic-role-re-balancing branch April 5, 2024 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants